Today we will…
Functions allow you to automate common tasks!
Writing functions has 3 big advantages over copy-paste:
Let’s define the function.
add_two <-The name of the function is chosen by the author.
The argument(s) of the function are chosen by the author.
If we supply a default value when defining the function, the argument is optional when calling the function.
something defaults to 2.{ }The body of the function is where the action happens.
return()Your function will give back what would normally print out…
return()If you need to return more than one object from a function, wrap those objects in a list.
When a function requires an input of a specific data type, check that the supplied argument is valid.
add_something <- function(x, something){
if(!is.numeric(x)){
stop("Please provide a numeric input for the x argument.")
}
return(x + something)
}
add_something(x = "statistics", something = 5)Error in add_something(x = "statistics", something = 5): Please provide a numeric input for the x argument.
add_something <- function(x, something){
if(!is.numeric(x) | !is.numeric(something)){
stop("Please provide numeric inputs for both arguments.")
}
return(x + something)
}
add_something(x = 2, something = "R")Error in add_something(x = 2, something = "R"): Please provide numeric inputs for both arguments.
The location (environment) in which we can find and access a variable is called its scope.
We cannot access variables created inside a function outside of the function.
Name masking occurs when an object in the function environment has the same name as an object in the global environment.
Functions look for objects FIRST in the function environment and SECOND in the global environment.
It is not good practice to rely on global environment objects inside a function!
You will make mistakes (create bugs) when coding.
print() debugging
print() statements throughout your code to make sure the values are what you expect.When you have a concept that you want to turn into a function…
Write a simple example of the code without the function framework.
Generalize the example by assigning variables.
Write the code into a function.
Call the function on the desired arguments
This structure allows you to address issues as you go.
Write a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).
find_car_make("Toyota Camry") should return “Toyota”.find_car_make("Ford Anglica") should return “Ford”.You will write a few small functions and use them to unscramble a message!
Today we will…
Check out the Canvas page outlining the group project!
We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).
find_car_make("Toyota Camry") returns “Toyota”.find_car_make("Ford Anglica") returns “Ford”.dplyrConsider the mtcars data.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Let’s use our new function:
mtcars |>
rownames_to_column("make_model") |>
mutate(make = find_car_make(make_model),
.after = make_model) |>
head(n = 3) make_model make mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 Mazda 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag Mazda 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 Datsun 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
penguins Data# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.
Is it a good idea to standardize (scale) variables in a data analysis?
Why standardize?
Why not standardize?
E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).
dplyrLet’s standardize penguin measurements.
Recall across()!
penguins |>
mutate(across(.cols = bill_length_mm:body_mass_g,
.fns = ~ std_to_01(.x))) |>
slice_head(n = 4)# A tibble: 4 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 0.255 0.667 0.153 0.292
2 Adelie Torgersen 0.269 0.512 0.237 0.306
3 Adelie Torgersen 0.298 0.583 0.390 0.153
4 Adelie Torgersen NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>
Note
I used the existing function std_to_01() inside the new function for clarity!
Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.
Tidy evaluation isn’t naturally supported when writing your own functions.
When a piece of code is defused, R doesn’t return its value like normal.
We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.
Don’t use tidy evaluation in your own functions.
rlangUse the rlang package!
tidyverse pipelines.rlangTwo ways to get around the issue of defused code:
{ }){ }, you can transport a variable from one function to another.enquo(arg) to defuse the variable.!!arg to inject the variable.rlangIf we use either of these solutions, we also need to use the walrus operator (:=).
:= instead of = in any dplyr verb containing one of these rlang fixes.std_column_01 <- function(data, variable) {
stopifnot(is.data.frame(data))
data <- data |>
mutate(variable = std_to_01(variable))
return(data)
}
std_column_01(penguins, body_mass_g)Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
mutate() doesn’t know what body_mass_g is.variable to make this work correctly!# A tibble: 6 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 39.1 18.7 0.292 male 2007
2 Adelie Torgersen 39.5 17.4 0.306 female 2007
3 Adelie Torgersen 40.3 18 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 0.208 female 2007
6 Adelie Torgersen 39.3 20.6 0.264 male 2007
# A tibble: 6 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 39.1 18.7 0.292 male 2007
2 Adelie Torgersen 39.5 17.4 0.306 female 2007
3 Adelie Torgersen 40.3 18 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 0.208 female 2007
6 Adelie Torgersen 39.3 20.6 0.264 male 2007
What if I want to modify multiple columns?
across()!# A tibble: 5 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 0.255 0.667 0.292 male 2007
2 Adelie Torgersen 0.269 0.512 0.306 female 2007
3 Adelie Torgersen 0.298 0.583 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 0.167 0.738 0.208 female 2007
Consider a study of depression.
We implicitly assume observations are missing completely at random!
We need to take more care when dealing with missing values!